Anandi Gupta

Data Science II Problem Set 2

Question 1 - Explore the data

Part A - attribute types

The dataframe consists of variables of 3 types - integers, floats, and objects. ID numbers, quantitative metrics such as listing count, no. of bedrooms or bathrooms, and host acceptance rates, and dummy variables indicating features such as whether the listing is instantly bookable or has availability are stored as floats or integers. Room type and neighborhood group are categorical variables (each category is a string).

Part B - Dimensionality

The dataset has 50796 observations and 36 variables.

Part C - Missing Values

The dataset is relatively complete for most variables, but attributes such as the host acceptance rate, security deposit and cleaning fee, and information about the review scores are more sparsely populated (about 10,000 - 20,000 observations missing).

Part D - Potential multicollinearity

It seems that there is some level of multicollinearity between the variables, but correlation between the various attributes is generally fairly modest (between -0.5 and 0.5 for most attributes). However, there are a few attributes such as availability_30 and availability_60, beds and bedrooms, host listings count and host total listings count, and reviews per month and number of reviews ltm that are either very highly correlated or perfectly correlated, indicating that multicollinearity is an issue for some variables (note, this list is not exhaustive).

Question 2 - Pre-Processing Pipeline

Note, dropping missing values reduces the number of observations substantially

Question 3 - R2 scores

Question 4 - Run Linear and LASSO Models

As we can see from the above results, as the penalty for the LASSO regression increases, the number of coefficients shrinking to 0 increases from 5 (alpha = 0.5) to 23 (alpha = 5.0)

Question 5 - Compute scores for linear and LASSO models

We see that the lasso model with alpha = 0.5 performs very similarly to the linear regression model (does slightly better on both the R squared and adjusted R squared metrics). However, as alpha increases to 5.0, the Lasso regression does worse on both the R squared and adjusted R squared metrics. This is likely because there are correlated features in the regression, and the LASSO regression randomly picks one feature and sets the remaining correlated features to 0, leading to loss of information and lower accuracy.

Question 6 - create and display a dataframe with the variable names, linear regression coefficients, and LASSO (alpha=5.0) coefficients as columns.